Machine learning reveals sex-specific associations between cardiovascular risk factors and incident atherosclerotic cardiovascular disease

We aimed to investigate sex-specific associations between cardiovascular risk factors and atherosclerotic cardiovascular disease (ASCVD) risk using machine learning. We studied 258,279 individuals (132,505 [51.3%] men and 125,774 [48.7%] women) without documented ASCVD who underwent national health screening. A random forest model was developed using 16 variables to predict the 10-year ASCVD in each sex. The association between cardiovascular risk factors and 10-year ASCVD probabilities was examined using partial dependency plots. During the 10-year follow-up, 12,319 (4.8%) individuals developed ASCVD, with a higher incidence in men than in women (5.3% vs. 4.2%, P < 0.001). The performance of the random forest model was similar to that of the pooled cohort equations (area under the receiver operating characteristic curve, men: 0.733 vs. 0.727; women: 0.769 vs. 0.762). Age and body mass index were the two most important predictors in the random forest model for both sexes. In partial dependency plots, advanced age and increased waist circumference were more strongly associated with higher probabilities of ASCVD in women. In contrast, ASCVD probabilities increased more steeply with higher total cholesterol and low-density lipoprotein (LDL) cholesterol levels in men. These sex-specific associations were verified in the conventional Cox analyses. In conclusion, there were significant sex differences in the association between cardiovascular risk factors and ASCVD events. While higher total cholesterol or LDL cholesterol levels were more strongly associated with the risk of ASCVD in men, older age and increased waist circumference were more strongly associated with the risk of ASCVD in women.

www.nature.com/scientificreports/ The global burden of atherosclerotic cardiovascular disease (ASCVD) is increasing 1 . Primary prevention, which includes the control of cardiovascular risk factors through lifestyle modifications or pharmacotherapy to prevent the first occurrence of ASCVD, is essential to minimize cardiovascular mortality and morbidity. Importantly, the target of these interventions is based on the individualized probabilities of ASCVD events 2,3 . Therefore, accurate risk prediction is important for identifying high-risk individuals to maximize the benefits of primary prevention. A body of evidence showed significant sex differences in the prevalence of cardiovascular risk factors and the incidence of ASCVD [4][5][6][7][8][9][10][11] . In the general population, men tend to have a higher prevalence of obesity, smoking, high blood pressure (BP), diabetes mellitus, and dyslipidemia when compared with women 4,5 . Regarding cardiovascular outcomes, men are at a higher risk of ischemic heart disease and cardiovascular mortality than women [6][7][8][9][10][11] . The associations between cardiovascular risk factors and outcomes also differ by sex [6][7][8][9] . These emphasize the importance of sex-specific cardiovascular risk assessment and targeted primary prevention strategies.
Several studies have shown that machine learning models have a similar or higher performance for predicting ASCVD probabilities compared to established risk scoring systems, such as the pooled cohort equations (PCE) or Framingham Risk Score [12][13][14][15][16] . However, few studies have developed sex-specific machine learning models. In addition, previous studies have rarely provided information on how each variable is associated with the outcome in these models, information which could significantly improve model interpretability. Thus, constructing separate machine learning models by sex and delineating the impact of cardiovascular risk factors on outcomes in these models may provide a deeper insight into the importance of cardiovascular risk assessment by sex.
We aimed to investigate the sex differences in cardiovascular risk factors and their association with outcomes using nationwide health examination data with a machine learning approach. The aims of this study were: (1) to develop a sex-specific machine learning model for the prediction of ASCVD probabilities; (2) to stratify important predictors of ASCVD in each sex in the machine learning models; (3) to investigate the sex-specific associations of these risk factors with ASCVD.

Methods
Cohort characteristics. The National Health Insurance Service (NHIS) in Korea covers the entire Korean population, and the NHIS database incorporates detailed information on the individuals' sociodemographics, medical check-up results including laboratory tests and health behaviors, healthcare utilization including diagnoses and treatments, and date and causes of death 17 . The representative sample of this database has been made publicly available for researchers, and its validity as a reliable data source has been established 18 .
Specifically, this study utilized the 'medical check-up sample cohort' of the NHIS database, which includes approximately 510,000 randomly sampled individuals (10%) aged 40 years or older from the general Korean population who underwent the standardized national medical check-up program in 2002 or 2003. These individuals were recommended to undergo repeated biannual medical check-ups up to 2013. Of these, we selected individuals who underwent medical check-up in 2009 or 2010, as 2009 was when levels of not only total cholesterol but its individual components, including low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol, and triglycerides, were also measured. The date of the NHIS medical check-up in 2009/2010 was used as the index date for each individual. We excluded individuals with a previous history of cardiovascular diseases at the index date, including ischemic heart disease (International Classification of Diseases, Tenth Revision [ICD-10] codes I20-I25), heart failure (ICD-10 codes I50 and I420), stroke (ICD-10 codes I60-69), and atrial fibrillation (ICD-10 code I48). Other exclusion criteria were chronic obstructive pulmonary disease, liver cirrhosis, end-stage renal disease, and cancer.
This study conformed to the Declaration of Helsinki and the Institutional Review Board approved the study protocol (Seoul National University Hospital, approval number: E-2104-087-1211). The need for informed consent was waived by the same ethics committee (Institutional Review Board of Seoul National University Hospital) as anonymized data were used.
Variable definitions. All clinical information was collected from the medical check-up conducted on the index date. Systolic and diastolic BP were measured after resting for at least 5 min. Data on smoking status, alcohol consumption, physical activity, and income levels were collected from structured self-administered questionnaires. Low income was defined as that within the lowest 30% of entire Korean residents. The lifetime amount of smoking was calculated as pack-years, and the mean alcohol consumption per day (g/day) was reported. The workload of daily physical activity was calculated as the metabolic equivalent of tasks (MET) minutes per week 19 , and the intensity of physical activity was categorized as low, moderate, and high intensity according to the International Physical Activity Questionnaire scoring protocol.
Past medical history of hypertension was defined as either (1) previous diagnostic codes for hypertension (ICD-10 codes I10-I13, I15) with prescription records of anti-hypertensive medications including angiotensinconverting enzyme inhibitors, angiotensin II receptor blockers, calcium channel blockers, thiazides, and betablockers, or (2) systolic/diastolic BP ≥ 140/90 mmHg measured at the medical check-up. A history of diabetes mellitus was defined by one of the followings: (1) previous diagnostic codes for diabetes mellitus (ICD-10 codes E11-E14) accompanied with prescription records of glucose-lowering medications, or (2) fasting glucose level > 126 mg/dL at the medical check-up. Dyslipidemia was defined as either (1) diagnostic codes for dyslipidemia (ICD-10 code E78) with prescription records of lipid-lowering medications or (2)  www.nature.com/scientificreports/ The PCE method was used to calculate the probabilities of 10-year ASCVD according to guidelines, using the original beta coefficients provided by the guideline 20 .
Outcome assessment. Individuals were followed up from the index date to December 31st, 2019, or death, whichever came first. The median follow-up duration of study participants was 10.1 years (interquartile interval, 9.6-10.5 years). The primary endpoint was newly developed 10-year ASCVD events, defined as a composite of myocardial infarction, stroke, heart failure, and cardiovascular death. Myocardial infarction was defined as hospital admission with a diagnosis of non-ST-elevation and ST-elevation myocardial infarction (ICD-10 codes I21 and I22). Stroke events were defined as a hospital admission with a diagnosis of ischemic or hemorrhagic stroke (ICD-10 codes I60-I64), along with brain computed tomography or magnetic resonance imaging during hospitalization. Heart failure was defined as hospitalization for heart failure (ICD-10 codes I50 and I42). Cardiovascular death was defined as mortality attributed to cardiovascular causes (ICD-10 codes for death: I00-I99).
Random forest model. In this study, random forest models were developed to predict 10-year ASCVD probabilities. We chose the random forest model over other machine learning models because it can effectively handle high-dimensional non-linear data and has a reduced tendency to overfit, thereby generally yielding high prediction accuracy in large-scale clinical datasets 21,22 . Moreover, the random forest model provides an effective variable selection that estimates variable importance.
We included 16 variables in the random forest model development: age (years), body mass index (BMI) (kg/ m 2 ), waist circumference (cm), systolic BP (mmHg), diastolic BP (mmHg), smoking (pack-year), alcohol consumption (g/day), physical activity (MET minutes per week), fasting glucose (mg/dL), total cholesterol level (mg/ dL), triglyceride (mg/dL), LDL cholesterol level (mg/dL), HDL cholesterol level (mg/dL), estimated glomerular filtration rate (mL/min/1.73m 2 ), AST (IU/L), and ALT (IU/L). The inclusion criteria were established risk factors for adverse cardiovascular events and variables routinely assessed for cardiovascular risk prediction 2,3 . We included only continuous variables since the random forest model may be biased in the assessment of relative variable importance by the variable type 23 . We used the relevant continuous variables for the categorical type of cardiovascular risk factors (i.e., fasting glucose level instead of diabetes mellitus). In addition, including additional categorical variables in the model (i.e., proteinuria) did not significantly improve risk prediction.
The outcome for the random forest model was set as the 10-year ASCVD event. We constructed a separate random forest model for each sex, and each group of men and women was randomly divided into training (70%) and test (30%) sets, the commonly used division ratio in machine learning studies. A decision tree was grown, and a random set of variables was chosen to split the samples into two branches, maximizing the decrease in node impurity. The predicted probability was a numeric value that ranged from 0 to 1. The model performance was tested with a different number of decision trees (ntree), minimum value of terminal node size (nodesize), and the number of variables randomly sampled as candidates at each split (mtry) (Supplemental Table S1) 24 . The bestperforming model with the highest area under the receiver operating characteristic curve (AUC) was selected as the final model. The performance of random forest models appeared robust with changes in the parameters: mean AUC values were between 0.721 (standard deviation 0.009) in men and 0.757 (standard deviation 0.013) in women. The calibration of random forest models was assessed to evaluate the relationship between predicted versus actual probabilities of 10-year ASCVD.
Relative variable importance. We used the permutation variable importance to stratify the importance of predictors in the random forest model 25,26 . The difference in prediction error before and after randomly permuting each variable is calculated, which is averaged over all trees and normalized by the standard deviation. The resulting measure is reported as the mean decrease in accuracy. A greater mean decrease in accuracy indicates a higher level of variable importance in the respective random forest model.

Partial dependency plot. For the top ten important variables, the relationship between the variables and
10-year ASCVD probabilities in the random forest model was visualized using the partial dependency plot, which is a useful tool to improve the model's interpretability 27 . The partial dependency plot is generated by calculating the marginal effect of a variable of interest on the outcome and integrating out the effects of all other variables 28,29 . The average probabilities of 10-year ASCVD were calculated at different values of a variable, which were traced using locally estimated scatterplot smoothing curves. Partial dependency plots were compared between men and women to investigate whether there were significant sex differences in these associations. To verify the associations between variables and ASCVD probabilities on the partial dependency plots, we further performed conventional Cox analysis, confirming these associations and assessing for any sex differences.
Statistical analysis. Continuous variables were presented as median values with interquartile ranges, and categorical variables were presented as frequencies with percentages. Differences between the groups were compared using the Kruskal-Wallis test for continuous variables and the chi-square test for categorical variables. The cumulative incidence of 10-year ASCVD and each component of the outcome were calculated using Kaplan-Meier estimates and compared between men and women using the log-rank test. The performance of the PCE-predicted ASCVD probabilities and random forest models was evaluated using AUC and compared using DeLong's method.
The associations between the top ten important variables in the random forest model and the risk of 10-year ASCVD were examined using Cox proportional hazard analysis and were reported as hazard ratios (HRs) with 95% confidence intervals (CIs). In the Cox analysis, BMI was categorized into < 18.5, ≥ 18.5 to < 25, ≥ 25 to < 30, www.nature.com/scientificreports/ and ≥ 30 kg/m 2 based on the U-shaped relationship between BMI and 10-year ASCVD probabilities observed in the partial dependency plot. Multivariable Cox models were adjusted for variables included in the PCE, avoiding multicollinearity. The Cox proportionality assumption was evaluated using the scaled Schoenfeld residuals plots. The differences in risks between men and women were tested in Cox models using the interaction term. A two-tailed P < 0.05 was considered statistically significant. All analyses were performed using R version 3.3.0 (R Foundation for Statistical Computing, Vienna, Austria). The R package randomForest was used for model development, and the partial dependency plots were generated using the pdp package 29 .

Results
Baseline characteristics according to sex. Of the 258,279 individuals, 132,505 (51.3%) were men and 125,774 (48.7%) were women. Women were slightly older than men (56 vs. 55 years, P < 0.001) and had a lower BMI (Table 1). Men had a higher systolic and diastolic BP (both P < 0.001), with a higher prevalence of hypertension (35.5% vs. 33.6%, P < 0.001), but the use of antihypertensive medications was less frequent in men than in women (25.8% vs. 29.8%, P < 0.001). The proportion of current smokers and the amount of alcohol consumption were markedly higher in men than in women (current smokers: 31.2% vs. 1.5%, P < 0.001; alcohol consumption:   www.nature.com/scientificreports/ 5.7 vs. 0.0 g per day, P < 0.001). Diabetes mellitus was also more prevalent in men than in women (12.6% vs. 8.6%, P < 0.001), with higher fasting glucose levels. On the other hand, women more frequently had dyslipidemia than men (26.6% vs. 17.9%, P < 0.001), and both total cholesterol and LDL cholesterol levels were significantly higher in women. The PCE-predicted 10-year ASCVD probabilities were 7.6% in men and 2.6% in women (P < 0.001).
Cardiovascular outcomes according to sex. During the 10-year follow-up period, 12,319 patients (4.8%) developed ASCVD, and the annualized rate of ASCVD in the entire cohort was 4.96 cases per 1000 person-year. The events of 10-year ASCVD included 3413 myocardial infarctions (1.3%), 6951 heart failure events (2.7%), 1776 stroke events (0.7%), and 2115 cardiovascular deaths (0.8%) ( Table 2). The cumulative incidence of ASCVD was significantly higher in men than in women (5.50 vs. 4.40 cases per 1000 person-year, P < 0.001). Men had a significantly higher incidence of myocardial infarction, heart failure, stroke, and cardiovascular death than women (all P < 0.050).
Performance of random forest model according to sex. Figure 1 shows the performance of the random forest model and PCE-predicted ASCVD probabilities for the prediction of 10-year ASCVD. PCE showed fair predictability for 10-year ASCVD, with a higher AUC noted for women (men: AUC 0.727, 95% CI 0.715-0.738; women: AUC 0.762, 95% CI 0.750-0.774). Similarly, the random forest model achieved an AUC of 0.733 (95% CI 0.721-0.744) for men and a higher AUC of 0.769 (95% CI 0.757-0.782) for women. The performance between PCE-predicted ASCVD probabilities and the random forest model was similar in both sexes (P-fordifference = 0.184 in men and 0.087 in women). Calibration plots of the random forest models are presented in Supplemental Fig. S1. The ASCVD probabilities predicted by the random forest were generally similar to the observed probabilities, although there was a tendency towards underestimation of random forest models for high ASCVD probabilities in both men and women.  www.nature.com/scientificreports/ Relative variable importance in random forest model according to sex. In both sexes, age and BMI were the two most important predictors in random forest models, with a mean decrease in accuracy of 95.4 and 72.2 in men and 86.5 and 74.3 in women (Fig. 2). Waist circumference, systolic BP, diastolic BP, total cholesterol, triglyceride, LDL cholesterol, AST, and ALT ranked between the third to tenth important variables in both sexes (Fig. 2). Smoking, drinking, physical activity, fasting glucose, HDL cholesterol, estimated glomerular filtration rate were variables with lower importance in both sexes. The top ten variables were consistently ranked between first and tenth place when different hyperparameters were used in the random forest model (Supplemental Fig. S2).
Partial dependency plots of top ten important variables according to sex. Partial dependency plots, which show the adjusted relationship of variables with the outcome, were generated from the random forest models for men and women (Fig. 3). In the partial dependency plot, the probabilities of ASCVD increased with age, and this trend was more prominent in women than in men (Fig. 3a). There was a U-shaped relationship between BMI and ASCVD probabilities in both sexes (Fig. 3b). The ASCVD probabilities increased with a higher waist circumference, more strongly in women (Fig. 3c). For systolic BP, the ASCVD probabilities gradually increased beyond 140 mmHg in men, whereas it increased more steeply once the systolic BP exceeded approximately 170 mmHg in women (Fig. 3d). The probabilities of ASCVD gradually increased with higher diastolic BP in both sexes (Fig. 3e). Higher total cholesterol and LDL cholesterol levels were associated with increased probabilities of ASCVD more strongly in men, whereas the association between triglyceride and ASCVD appeared stronger in women (Fig. 3f-h). The increase in AST was more strongly associated with the ASCVD probabilities in men (Fig. 3i). ASCVD probabilities increased with higher ALT in both sexes (Fig. 3j).

Associations of cardiovascular risk factors with 10-year ASCVD risk according to sex.
In the univariable Cox analysis, an increase in age, BMI < 18.5 kg/m 2 or ≥ 30 kg/m 2 compared to BMI ≥ 18.5 to < 25 kg/ m 2 , increase in waist circumference, systolic/diastolic BP, total cholesterol, triglyceride, AST was associated with higher ASCVD risk in both sexes (all P < 0.050) (Supplemental Table S2). However, an increase in LDL cholesterol or ALT level was associated with a higher ASCVD risk only in men.

Discussion
This study demonstrated significant sex differences in cardiovascular risk factors and their associations with ASCVD probabilities in the general population by applying machine learning to large-scale nationwide health examination data. Our random forest models had fair performance in predicting 10-year ASCVD, with a higher performance noted for women than for men. Importantly, the partial dependency plots demonstrated distinct sex-specific associations between these risk factors and 10-year ASCVD probabilities, which were also verified using Cox analysis. While the risk of ASCVD increased with higher total cholesterol and LDL cholesterol level more strongly in men, increased age and waist circumference were associated with higher ASCVD risk, especially in women (Fig. 4).
There are significant differences in the prevalence of cardiovascular risk factors according to sex. Men generally have more risk factors than women, including a higher prevalence of hypertension, smoking, and diabetes 4,5 , and this was also observed in our cohort. We also observed that the cumulative ASCVD events were significantly higher in men than in women, with more frequent events of myocardial infarction noted in men. Regarding pharmacologic treatment for ASCVD prevention, studies have shown that women may be less likely to be treated www.nature.com/scientificreports/ for dyslipidemia, whereas men may receive less antihypertensive treatment 4,5,[30][31][32] . Patient perception of health status, patient-provider communications, and quality of life related to ASCVD may also be substantially different by sex 32 . Given these significant differences, sex-specific risk assessments and targeted prevention strategies are important to improve the outcomes of ASCVD.
While several previous studies have constructed machine learning model for predicting ASCVD, they rarely considered sex-specific models or evaluated sex-related differences using these models [12][13][14][15][16] . Interestingly, the performance of our random forest model was higher in women than in men (men: AUC 0.733 vs. women: AUC 0.769), and this was also observed for the PCE-predicted ASCVD probabilities (men: AUC 0.727 vs. women: AUC 0.762). This finding suggests there is a need to improve risk prediction specific to each sex. It may also imply more complex patterns and interactions between cardiovascular risk factors in men. Deep phenotyping or clustering individuals into similar but mutually exclusive subgroups may enable a more accurate prediction of ASCVD risk in men. The results also suggest that incorporating variables beyond the traditional cardiovascular risk factors may help to improve risk prediction for each sex. Genetic information, serum biomarkers, or socioeconomic factors, including income, education level, and relationship status, contribute to the development of ASCVD and have the potential to improve predictive ability 33,34 . For women, menopausal status or gestational diabetes are independent predictors of ASCVD 35 . Future studies are required to test these possibilities.
To enhance interpretability, we further investigated how each variable is associated with ASCVD probabilities using feature extraction technique of a partial dependency plot. In partial dependency plots of age, we observed that the adverse effects of aging on the risk of ASCVD were stronger in women than in men. The cardio-protective role of estrogen may be one reason for the heightened risk associated with aging in women, especially after menopause. Estrogen plays an important role in the maintenance of cardiac structure and function by reducing oxidative stress, preserving endothelial function, and preventing the accumulation of myocardial fibrosis 36 . Studies have shown that cardiovascular mortality rates increase more steeply with age in women than in men, especially after age 45-64 years 37 . Importantly, the earlier onset of natural menopause is associated with a higher risk of ASCVD, further supporting the concept of female vulnerability related to estrogen withdrawal 35,38 . However, our data did not contain information on the menopausal state or estrogen level, and future studies are warranted to clarify the role of estrogen and menopause in the sex-specific association between age and risk of ASCVD.
We observed a U-shaped association between BMI and ASCVD probability in both sexes. Recent studies have demonstrated that underweight is a robust risk factor for adverse cardiovascular events, including heart failure, cardiovascular mortality, and all-cause mortality 39,40 . The exact mechanism underlying the association between underweight and cardiovascular events is not yet fully understood. However, it is plausible to speculate that underweight may be indicative of malnutrition status or sarcopenia, both of which have significant implications www.nature.com/scientificreports/ for cardiovascular health. Our findings indicate that underweight is a significant risk factor in both men and women, highlighting the importance of optimizing body weight for the prevention of ASCVD.
Our study provides comprehensive information on the sex differences in cardiovascular risk factors and their association with the risk of ASCVD, which support the concept of targeted primary prevention interventions based on the assessment of individualized risk in terms of sex difference. Our findings imply that more active lipid-lowering therapy may benefit men, whereas control of abdominal obesity may be more crucial in women. However, more conclusive data demonstrating the benefits of such targeted intervention is eventually required, and future studies are warranted to test this hypothesis to improve cardiovascular outcomes in clinical practice.
Limitations. Our study has several limitations. First, although our analysis showed associations between cardiovascular risk factors and outcomes, it does not demonstrate the cause-and-effect relationship. Potential confounders across biological and sociodemographic factors (i.e., income status) may have influenced the study results, which is inherent to all observational cohort studies. Second, some risk factors, such as BP, may change during follow-up, and the change or variability of risk factors was not considered in our analysis. Third, the random forest model is not designed to account for time-to-event survival data. While the random survival forest algorithm has been shown to provide risk prediction using this type of data, we were unable to implement this algorithm due to its lengthy computational time with large sample sizes. Lastly, our study was conducted solely on individuals of Korean ethnicity, which may limit the generalizability of our results to other populations. Notably, smoking and alcohol consumption were markedly lower in women compared to men, which is consistent with previous research in Korea 41,42 . Therefore, further research is necessary to determine if our findings are applicable to other ethnic groups.

Conclusion
In conclusion, we developed a sex-specific machine learning model for predicting ASCVD events and investigated the associations between cardiovascular risk factors and ASCVD events in a large cohort from the general population using routine health examination data. While higher total cholesterol or LDL cholesterol levels were more strongly associated with the risk of ASCVD in men than in women, an increase in age and waist circumference were more strongly associated with the risk of ASCVD in women.

Data availability
All data created and/or used during this study are not publicly available according to the NHIS policy. Researchers can submit an application form through the NHIS website (https:// nhiss. nhis. or. kr) to access and analyze the database.